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ABSTRACT Over the last couple of decades, human fall detection has gained considerable 
popularity, especially for the elderly. Elderly people need more attention as compared to 
others in their homes, hospitals, and care centers. Various solutions have been proposed to 
deal with this problem, yet, many aspects of this problem are still unresolved. The current 
study proposed an approach for human fall detection based on the Visual Geometry 
Architecture of deep learning. The presented approach was weighed up with state-of-the- 
art approaches including ResNet-50 and even ResNet-101 by using MCF and URFD 
datasets, outperforming them with an accuracy of 98%. The proposed approach also 
‘outperformed these deep architectures in terms of performance efficiency 


INDEX TERMS Convolution Neural Network (CNN), deep learning, fall detection, 


ResNet-50, ResNet-101, VGG 16 
1 INTRODUCTION 


The elderly population is growing at a rapid 
pace all over the world. For instance, in 
China, the elderly people constitute about 
10.1% of the total population and this 
figure is expected to rise to 35% by 2050 
[i]. Similarly, in Tunisia, the elderly 
population is expected to increase to 17.7% 
of the total population by 2029 [2]. In 
Canada and Europe, most senior citizens 
prefer to stay at their homes [3]. Living 
alone is in itself huge threat to the health 
and well-being of a person, irrespective of 
their age. In addition to that, the absence of 
caregivers also puts elderly people at a 
greater risk of injuries caused by accidents, 
especially falls. Every year in the USA, 
approximately one-third of its senior 
citizens, over the age of 65, suffer from fall- 
related injuries, resulting in an annual cost 
of around $31 billion [4]. The statistics of 
the total number of falls in the US starting 
from the year (2007-2016) is shown in 


Figure I [5]. In the UK, 70% of the deaths 
among elderly people, 75 years and above, 
are due to falls [6]. In recent years, with the 
advent of high-tech smart homes, new 
applications are being developed to 
improve the quality of life for the old and 
vulnerable group of people. One such 
active area of research is automatic fall 
detection. 


The economic impact of fall-related 
injuries, especially in elderly population, is 
substantial. Recent research [7] shows that 
in 2010, the total per capita cost of fall- 
related injuries in the USA was 1,186 USD 
for people over the age of 75. This was 
much higher than the per capita costs of 
people belonging to other age groups. For 
instance, for people younger than 35, the 
costs were less than 300 USD. From 
another report [8], it can be seen that the 
total number of people injured due to falls 
is much higher in the elderly population, In 
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the 75+age group, the total number of eases 
was 456,000. If we compare this with 


the younger demographic groups, for 
instance, the 35-44 year age group, the 
cases come down to 303,000. Figure I 
shows the statistics of fall-related injuries 
corresponding to various states of the US. 
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FIGUREL, Distribution of Fall Death 
Rates Per 100,00 in the US Ranging from 
Year 2007-2016 


It is important to mention that the impact of 
such injuries is greater in elderly people 
since they take much longer to recover as 
compared to younger patients [6]. Overall, 
these figures show the gravity of the 
problem, Fall detection becomes critical, 
particularly in situations when elderly 
people are alone, as it has been a growing 
trend in recent years. Fall detection 
technology can also help insurance 
companies to identify fraud cases. The slip 
and fall claims are the most common 
among staged accidents. This is because 
fall-related injuries are very hard to 
disprove and potential payouts can be very 
high [9]. The detection systems can help 
insurance companies to identify staged falls 
so that their money is spent only on 
deserving people. Timely recognition can 
facilitate the immediate delivery of medical 
care to the wounded, 
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According to the reports of World Health 
Organization (WHO), the number of 
surgeries and death rates from fall 
incidences are among the major health 
issues in the whole world [6]. According to 
it, falls cause 37.3 million critical injuries 
and 646,000 deaths each year, leading to 
significant health and financial issues 
worldwide, 25 


my A 
FIGURE 2. Statistics Showing Fall 


Injuries Corresponding to Various States of 


the US 


‘The cost of fall-related injuries varies by 
different age brackets. According to a study 
in the US on emergency departments 


of hospitals [7], the per-person cost due to 
fall-related injuries, in (18-24) age bracket 
was 238.248, While, for elderly people of 
age 75 and above, it was much higher, 
around 1186.00$. This clearly shows a 
huge difference in cost due to fall injuries 
between the young and the elderly. The 
current study proposed a deep-learning 
algorithm to solve the problem of human 
fall-detection. It also compared the 
proposed approach with the state of the art 
existing deep architectures in terms of the 
accuracy. Moreover, this comparison has 
also been made in terms of performance 
efficiency as well. 


‘The remaining research is sorted-out as 
follows: Section II describes the related 
work, while Section II explains the 
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proposed methodology and data set. 
Section IV presents various experiments 
along with their analysis, while the 
conclusion and future work is given in 
Section V. 


I. RELATED WORK 


In 2015, Saiymon et al. developed a flexible 
robot that relied on an edge detection-based 
system to perceive human fall [10] 
Microsoft kinetic and PC sensors were used 
to assist people with their movements. This 
procedure, removed the limitations 
associated with systems that relied on a 
fixed sensor to detect human falls. 
Chaceour et al. conducted extensive 
research on various systems, from fall 
detection to fall prevention, based on their 
sensor deployment [11]. Their main aim 
was to assist medical technologists in 
public health field regarding fall-related 
systems, 


In 2020, Singh et al. reviewed different 
sensor technologies and their applications 
in human fall detection [12]. A 
comprehensive technological insight was 
provided into the existing fall detection 
system along with the classification of 
number of approaches, and then the 
challenges which were encountered during 
the implementation. They classified the fall 
detectors into three categories, namely 
ambiance-based, wearable, and hybrid 
sensing detectors, which were then further 
explored by the sensor technology. A 
comprehensive overview was provided by 
competing for sensor technology which 
ranged from an accelerometer, radar to 
camera-based and pressure sensor, suffuse 
into a fall-detection system. 


In the same year, Ruzaij et al. proposed a 
wearable low-cost fall detection and alarm 
system that was used to reduce false 
positive alarms [13]. It was used to monitor 
the movement of human beings and in case 


of any fall, the system sends a caution 
message to the relevant authority. 


In 2006, Miaou et al. proposed a fall 
detection system by using MapCam (Omni 
camera) [14]. They combined personal 
information with parameters, such as height 
and weight and achieved a classification 
accuracy of 79.8%. Without adding 
personal parameters, the accuracy came 
down to 68%, 


Doukas et al. proposed a feature-based 
human fall detection framework that 
depends on the sound, video, and motion 
captured from the patient's body [15] 
Audio and video were used to detect 
features, such as frequency, acceleration, 
speed, and closeness impact, Tracking 
techniques were applied to a sequence of 
frames, while audio data processing and 
sound directionality, together with visual 
location and motion information of the 
subject can detect the fall. The post-fall 
analysis including the motion behavior of 
subject was used to compute the severity of 
the fall, 


In 2013, Ali et al. proposed a feature-based 
approach for human fall detection [16]. The 
approach used various geometric, location, 
and motion-based features including 
centroid, head position, aspect ratio, fall 
angle, motion vector, and aspect ratio and 
achieved an accuracy of 96.58% on the 
MCF dataset. They extended the problem 
and proposed a fast and more accurate real- 
time system under a boosting framework 
[17]. In their approach, novel temporal and 
spatial variance based features were 
proposed which comprised of geometric 
orientation, discriminatory motion, and the 
location of the person. Their proposed 
approach achieved an accuracy of 99% on 
the Multiple Camera Fall data set. 


In 2016, Wang et al, proposed an approach 
for human fall detection, based on the 
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PCANet model [18]. The approach not only 
predicted frames, however, also the video 
sequence. They classified the frames into 
three labels namely, standing, falling, and 
fall, Two linear SVM models were trained, 
‘one for the detection of a single frame, and 
the other model was trained for the 
detection of a video sequence. The features 
extracted from PCANet were fed to linear 
SVMs. The approach obtained sensitivity 
and the specificity of 93.81% and 98.4%, 
respectively on the MCF dataset 


In 2018, Min et al. proposed an approach 
for human fall-detection, based on scene 
analysis [19]. The scene analysis was 
performed by Region-based convolutional 
neural networks (R-CNN) to compute the 
space relationship between human beings 
and furniture, The handcrafted features 
including centroid, aspect ratio, and motion 
history of the human beings were detected 
and tracked. By using the space relationship 
and handcrafted features, the fall on 
furniture was effectively detected. The 
method not only detected fall on furniture, 
such as chairs and sofas, however, it also 
distinguished them from other alike 
activities 


Shoja et al. described a model for human 
fall detection by using Recurrent Neural 
Network (RNN) [20]. Every time a fall was 
observed in the video, an alarm was 
generated and a message was immediately 
sent to the medical staff. RNN was not very 
well trained due to the "vanishing gradient’ 
To overcome this issue, Long Short-Term 
Memory (LSRM) was used. In 2019, 
Espinosa et al. [21] presented a multi- 
camera vision-based fall detection system. 
They took advantage of CNN’s visual 
feature extraction by using the optical flow 
method. In the same year, Santos et al. 
considered an IoT and the Fog Computing 
environment for human fall detection [6] 
They proposed a CNN which was 
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composed of three convolutional layers, 
two max pools, and three fully connected 
layers, as their deep learning model. 
Accuracy, sensitivity, specificity, 
precision, and the Matthews Correlation 
Coefficient were used to evaluate the 
performance, however, much better results 
were attained while using data 
augmentation during training process. 


In 2020, Han et al. proposed the two-stream 
approach for fall detection by constructing 
a lightweight network model by replacing a 
convolutional neural network with a 
simplified one [22]. MobileVGG was put- 
up as a lightweight network model to 
replace the conventional convolution with 
disentangled and efficient combinations of 
point convolution and depth convolution. 
Cameron et al. applied the multistream 
learning model based on a CNN to provide 
a solution for the two categories of human 
fall detection problem [23]. They used 
high-level handwrought features as input, 
to cope with this situation, Their approach 
consisted of extracting high-level 
handcrafted features, for instance, human 
pose estimation and an optical flow and 
using each one of them as an input for a 
perceptible VGG16 classifier. Besides, 
these experiments showcase which features 
can be used in fall detection. The results 
showed that by gathering the govem input 
learners, the approach surpassed in terms of 
sensitivity rates and accuracy. 


Ill, PROPOSED METHODOLOGY 


Visual Geometry Group (VGG-16), a deep 
network having 3 fully connected and 13 
convolutional layers, is specially designed 
for image classification [24]. VGG-16 
proposed the idea of an effectual receptive 
field and gave idea of simple and 
homogenous topology. However, it also 
used the computationally expensive fully 
connected layers. As a result, they required 
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‘more computational power and memory 
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FIGURE 3. Methodology 


VGG-16 falls under the spatial exploitation 
category and uses a fixed size of input 
images, that is, 224x244. The spatial 
padding of the input of convolutional layer 
is preserved after every convolution, that is, 
l-pixel padding is fixed for the 
convolutional layer of (3*3). Different 
deep architectures rely on different depths 
of convolutional layers. In VGG-16, the 
convolutional layers stack observed three 
fully connected layers, while total of 4096 
channels were used in the first two layers. 

‘The third layer executes 1000-way for 
ILSVRC classification, therefore, each 
class holds 1000 channels. The final layer 
of the architecture is the SoftMax layer. The 
same configuration is used for all fully 
connected layers present in the network 
[26]. Equation 1-3. shows the output width, 
‘output height, and output matrix of 
VGG16. Moreover, it also proposed an idea 
of an effective receptive field and gave idea 
of simple homogeneous topology. 


WeFW+2P) || 
Sw 


a) 


Outputwidth = ( 


OutputHeight 


to calculate the pooling layer. The formula 
is given below; 


A 2P- 
om = (A) +1 
8 @) 


where, 


Number of filters 
+ FW=Filterwidth 

+ Fh Filter height 

+ Sh= stride height 

+ Sw= stride width 

+ P= padding 

+ IM= Input matrix 

+ OM = Output matrix 


‘The input fall videos were fragmented into 
frames then each frame observed further 
fall stages. Some data frames for fall videos 
are shown in Fig II [26]. 
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TABLE I 
SHOWING VARIOUS PARAMETERS 


OF VGG-16 
Methods 
Top 5 errors 
input Size 
16 
3 
Feature Maps 3-512 
Stride 1 
Weights used 147M 
MACS used 15.3G 
used FC layers 3 
Weights 124 
Number of MACS 124M 
Total Weights 138M 
_Total MACS 15.5G 


AA TETEN 


aaae 


VGG-16 architecture has been shown i 
Fig II [3]. It shows different convolutional 
layers and how these layers interact with 
cach other. The input of convolutional layer 
1 consists of 224x224 RGB images that 
pass through various layers including stack, 
convolutional layer, and spatial pooling 
layer [27]. VGG-16 methods are shown in 
Table 1 [28]. 

Deep Residual Network (ResNet) seems to 
be an ANN developed to identify the issue 
of low accuracy while building a plain 
ANN with the deeper layers. In other 
words, Deep Residual Network's objective 
is to make highly accurate ANN with 
deeper layers. The principle of a Deep 
Residual Network is to create ANN which 
can change the weight of a deeper layer 


29) 


E convoiun 


FIGURE $. Keyframes from MCF data showing human fall 
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A. DATA SETS 


Frames are extracted from different falls 
and no fall videos and extracted images are 
resized into 128 x128. The UR and MCF 
data sets are combined in one data set 
containing a total of 2000 frames. Out of 
the total, 1000 frames contain fall while the 
remaining contain no fall. 


The fall sequence consists of depth RGB 
data, which is further labeled as category 0 
(No Fall) or category 1 (Fall) [30], (31). 
The second dataset is multiple camera 
datasets that contain 24 different scenarios 
[26]. The dataset contains falls of various 
types including forward fall, backward fall, 
slip, falling on furniture, and falling on a 
floor. 


IV. EXPERIMENTS 
DISCUSSION 


AND 


‘The current study compared the proposed 
approach (PA) that is, VGG-16 with state- 
of-the-art existing networks including 
ResNet50 and ResNet-101 in terms of 
percentage accuracy and time efficiency. 


A. EXPERIMENT 1: COMPARISON IN 
TERMS OF ACCURACY 


VGG-16 outperforms the ResNet-50 and 
ResNet-101 in terms of percentage 
accuracy as seen in figure IV-A and table 
IV-A. As MCF and URFD datasets are 
combined into a single dataset hence, the 
dataset shows high intraclass variations. PA 
showed excellent results in such data of 
high diversity. 


As explained in fig II, the given data input 
size was 224*224 for conventional layers 
size of 16 with filter size 3. When it comes 
to number of feature maps, they were 3512 
and for that number of weight and MACS 
were 14.7M and 15,3G, respectively. 
Afterwards, the 3 FC layers were included 
with several weights and MACS 124 and 


spectively, The total weights were 
138M and MACS were 15.5G. 


TABLE II 


COMPARISON OF PA WITH RESNET- 
50, AND RES101 IN TERMS OF 
PERCENTAGE ACCURACY AND THE 
TIME EFFICIENCY 


RewNet  ResNet 
so stor PA 

Toy 91 96.11% 98057 

Time : p 9 


(see) 


B. EXPERIMENT 2: COMPARISON IN 
TERMS OF EXECUTION TIME 


It can be seen in figure 7 that PA execution 
time is comparable with resNet 101 
However, resNet-50 shows better time 
efficiency than PA by compromising its 
percentage efficiency. Considering this 
performance of PA in terms of time, it can 
be safely predicted that this approach can 
be used later in real-time systems, 


xn 


as | 
| move) 


FIGURE 6. Comparison of PA with 
Existing Deep Networks in terms of 
Percentage Accuracy 
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FIGURE. 7. Comparison of VGG-16 with 
Existing Deep Networks in terms of Time 
Efficiency 

V. CONCLUSION AND FUTURE 
WORK 


The current study compared the 
performance of VGG-16 with other deep 
learning architectures. It concluded that 
proposed architecture VGG-16 exceeded 
others both, in terms of percentage 
accuracy and time efficiency. The work can 
be further extended to achieve more 
accuracy on datasets generated in an 
uncontrolled environment. In the future, 
deep algorithms would continue to show 
more promising results, The addition of a 
large data repository for fall detection 
would help in further improving the results 
of the proposed algorithm. Human fall 
detection in dim light, occlusion, and 
camouflaged needs to be further 
investigated. A robust and real-time system 
that shows even better percentage accuracy 
and works in an uncontrolled environment 
is the need of the hour, 
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